Tracing with OpenTelemetry #598

kamilkisiela · 2025-12-02T16:34:03Z

This pull request adds comprehensive OpenTelemetry‑based tracing.
It adds a set of GraphQL‑specific spans, connects the Router to Hive Console's Tracing and other OpenTelemetry exporters (via OTLP).

Goals of this Pull Request:

Introduce OpenTelemetry tracing
Integrate with Hive Console's Tracing
Support OTLP exporters (gRPC and HTTP)
Tracing propagation (trace context, baggage, jeager, b3/zipkin)

Things out of scope for this Pull Request:

No debug-level spans (we can add them when needed)
Customization and filtering of spans (future Pull Request)
Metrics (after the point above is done)
Logs (after the work on polishing the logs is done)

Code changes

Almost all code that is responsible for tracing and setting up OpenTelemetry is in internal crate's telemetry directory.
You will see there span definitions, exporters setup, Hive Console Tracing integration and things like that.

Every span created by Hive Router has hive.kind attribute that describes what kind of action the span represents. This is useful to filter spans without relying on span names, which can change over time.

Performance costs?

To minimize the performance impact of tracing to minimum, spans are only created when tracing is enabled. All the computation that is needed to create attributes is also only done when tracing is enabled. I spent some time making sure the overhead is zero or near-zero when tracing is disabled.

With tracing enabled, the overhead is still as low as I could make it, by avoiding allocations, using static strings for attributes names and avoiding complex computations when creating spans and attributes. There's of course some cost (TODO: measure it once again and get the numbers).

How spans are created and used?

Spans are created using tracing::span_info. Every "phase" has its own span and a struct that helps with creating and managing it. This is intentional, to store all span-related logic in one place and make it easy to understand what attributes are set on each span, and what methods are available to manipulate the span and not pollute the rest of the codebase with span-related logic. Imagine looking for the parsing span and the attributes if it's not centralized in one place.

For example, the parsing phase has GraphQLParseSpan struct that creates a span and has the record_cache_hit method to record whether the parsing was a cache hit or miss.

The tracing crate "requires" attributes to be static strings, and predefined at the span creation. This is for performance reasons.
That's why all the standard attributes are there from the beginning, that we later on fill in with data, like previously mentioned record_cache_hit method that fills in the cache.hit attribute.

You will notice that attributes names are defined as constants in the telemetry::attributes module, but not used when the spans are created. The reason for it is that tracing crate requires attributes to be static strings, so we can't use variables there.
I stored them as constants to avoid typos and have a single source of truth for attribute names, when we need to refer to them later on (like when to fill them in with data, or transforming them for many reasons). This gives us some kind of type safety.
To fillup the gap between static strings used for span creation and constants defined in the attributes module, I added a bunch of tests that verify that all attributes defined in the attributes module are actually used when creating spans. This way we can be sure that there are no typos and that all attributes are defined and match.

Spans are started and stopped automatically by either using the guard pattern (calling span.enter() returns a guard that stops the span when dropped) or instrumenting the futures with the span (using instrument method).
Spans are created by calling ::new() method on the span struct, which returns the struct instance. The span itself is stored inside the struct.

What exporters are supported?

Currently, two exporters are supported:

Hive Console Tracing exporter
OTLP exporter (gRPC and HTTP)

The OTLP exporter can be configured to send traces to any OTLP-compatible backend, like Jaeger, Zipkin, Honeycomb, NewRelic, Datadog and many others.

I tried to make everything configurable, so you can enable/disable specific exporters, set endpoints, headers and other options via environment variables or expressions.
Even the collection of spans can be configured, the batching options can be set, so users can tune the performance of the tracing to their needs.

Semantic conventions for HTTP

There's a standard for HTTP semantic conventions defined by OpenTelemetry for attributes, but not every tool supports it yet.
In Hive Router, we support both deprecated attributes and the stable ones, so that we can interoperate with more tools.

This is also configurable, so users can choose which set of attributes they want to use. We have three modes:

spec_compliant: use only stable attributes
deprecated: use only deprecated attributes
spec_and_deprecated: use both sets of attributes

How deduplicated requests are traced?

The http.inflight span is created for each outgoing HTTP request to subgraphs when request deduplication is enabled. Multiple concurrent identical requests will create separate http.inflight spans in different traces, but only the leader executes the HTTP call.

Leader role (hive.inflight.role="leader"):

First request to win the deduplication race
Creates http.client as a child span to execute the actual HTTP request
All joiners wait for this leader's result

Joiner role (hive.inflight.role="joiner"):

Subsequent identical concurrent requests
Does not create http.client (that span exists in the leader's trace)
Waits for the leader's HTTP response
References leader via hive.inflight.key (same fingerprint value)

Cross-trace correlation: All inflight spans (leader + joiners) for the same deduplicated request share the same hive.inflight.key value, allowing observability tools to correlate them across different traces. In future we may introduce a span link.

Leader trace (first/winning request):

GraphQL Subgraph Operation
  └─ http.inflight (role="leader", key=12345)
      └─ http.client  <- executes the actual HTTP request

Joiner trace (deduplicated request):

GraphQL Subgraph Operation
  └─ http.inflight (role="joiner", key=12345)
      └─ [no http.client - it's in the leader's trace]

Changes in Hive SDK

Hive Console's Tracing requires the operation hash to be sent as an attribute, and represent the same hash as in Hive SDK (Usage Reporting). I made changes to Hive SDK and exposed a function to compute the operation hash, and used it in Hive Router to set the hive.operation.hash attribute on the graphql.operation span.

I noticed that Hive SDK relies on graphql_parser to minifiy and normalize the query before hashing it, but it's super inefficient (we do Document to String conversions etc) that I hope to reimplement in a more efficient way in near future, but that requires forking and owning the graphql_parser crate.

github-actions · 2025-12-11T11:42:40Z

✅ `k6-benchmark` results

     ✓ response code was 200
     ✓ no graphql errors
     ✓ valid response structure

     █ setup

     checks.........................: 100.00% ✓ 237492      ✗ 0    
     data_received..................: 6.9 GB  230 MB/s
     data_sent......................: 93 MB   3.1 MB/s
     http_req_blocked...............: avg=3.3µs   min=661ns   med=1.71µs  max=27.5ms   p(90)=2.39µs  p(95)=2.76µs  
     http_req_connecting............: avg=418ns   min=0s      med=0s      max=1.71ms   p(90)=0s      p(95)=0s      
     http_req_duration..............: avg=18.47ms min=2ms     med=17.6ms  max=111.21ms p(90)=25.3ms  p(95)=28.41ms 
       { expected_response:true }...: avg=18.47ms min=2ms     med=17.6ms  max=111.21ms p(90)=25.3ms  p(95)=28.41ms 
     http_req_failed................: 0.00%   ✓ 0           ✗ 79184
     http_req_receiving.............: avg=132.1µs min=22.33µs med=38.57µs max=35.38ms  p(90)=80.14µs p(95)=369.01µs
     http_req_sending...............: avg=24.06µs min=5.45µs  med=10.56µs max=29.36ms  p(90)=15.85µs p(95)=27.41µs 
     http_req_tls_handshaking.......: avg=0s      min=0s      med=0s      max=0s       p(90)=0s      p(95)=0s      
     http_req_waiting...............: avg=18.31ms min=1.93ms  med=17.48ms max=84.44ms  p(90)=25.04ms p(95)=28.09ms 
     http_reqs......................: 79184   2634.220672/s
     iteration_duration.............: avg=18.94ms min=4.82ms  med=17.96ms max=234.5ms  p(90)=25.75ms p(95)=28.92ms 
     iterations.....................: 79164   2633.55533/s
     vus............................: 50      min=50        max=50 
     vus_max........................: 50      min=50        max=50

github-actions · 2025-12-11T11:45:53Z

🐋 This PR was built and pushed to the following Docker images:

Image Names: ghcr.io/graphql-hive/router

Platforms: linux/amd64,linux/arm64

Image Tags: ghcr.io/graphql-hive/router:pr-598 ghcr.io/graphql-hive/router:sha-b4a1be1

Docker metadata

{
"buildx.build.ref": "builder-4ca15e5d-8d3b-43dc-840e-1df07381fb89/builder-4ca15e5d-8d3b-43dc-840e-1df07381fb890/w5hxbm3j7m2zr8fqh60i743bc",
"containerimage.descriptor": {
  "mediaType": "application/vnd.oci.image.index.v1+json",
  "digest": "sha256:8cd2eb3a39b51b3a9af502651c51716ab239505110aa2618cb4ce0dfb9a7665f",
  "size": 1609
},
"containerimage.digest": "sha256:8cd2eb3a39b51b3a9af502651c51716ab239505110aa2618cb4ce0dfb9a7665f",
"image.name": "ghcr.io/graphql-hive/router:pr-598,ghcr.io/graphql-hive/router:sha-b4a1be1"
}

kamilkisiela · 2026-01-08T16:04:33Z

/gemini review

gemini-code-assist

Code Review

This is an impressive and comprehensive pull request that introduces OpenTelemetry-based tracing to the router. The changes are well-structured, with clear separation of concerns in the new telemetry module. The addition of custom spans for GraphQL pipeline stages, HTTP requests, and the thoughtful implementation of context propagation and semantic convention compatibility are excellent.

My main feedback is a high-severity issue regarding the use of info_span! for instrumentation on hot paths, which goes against the repository's performance-first style guide. I've left a detailed comment with a suggestion to address this. Overall, this is a fantastic contribution that will significantly improve the observability of the router.

lib/internal/src/telemetry/traces/spans/graphql.rs

kamilkisiela · 2026-01-22T16:43:58Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces comprehensive OpenTelemetry-based tracing to the Hive Router, integrating with Hive Console and supporting OTLP exporters. The changes involve significant refactoring to centralize telemetry configuration, add GraphQL-specific spans, and ensure minimal performance overhead when tracing is disabled. The Cargo.lock file has been updated to reflect new dependencies, and several new modules have been added under lib/internal/src/telemetry to manage tracing logic, span attributes, and exporters. Overall, the changes are well-structured and align with the goal of providing deep visibility into the GraphQL request lifecycle.

gemini-code-assist · 2026-01-22T16:50:29Z

lib/internal/src/telemetry/traces/hive_console_exporter.rs

+                tracing::error!(
+                    { component = "hive_console_exporter", trace_id = trace_id.to_string() },
+                    "Span indexes out of bounds. Trace ignored.",
+                );
+                ignored_trace_ids.insert(trace_id);
+                continue;


The tracing::error! macro is used here to log when span indexes are out of bounds, leading to the trace being ignored. This is a critical error that should ideally not happen. Ensure that the indexing logic is robust to prevent this. Also, consider the performance implications of this error logging if it occurs frequently on a hot path, as per the repository style guide (line 20).

References

Hot-path budget ≈ near-zero: Guard everything non-trivial, minimize allocs, avoid dynamic dispatch and stringification in loops, keep data on the stack/arenas/Bytes where possible. ^(link)

gemini-code-assist · 2026-01-22T16:50:29Z

lib/executor/src/executors/http.rs

+            // TODO: let's decide at some point if the tracing headers
+            //       should be part of the fingerprint or not.
+            self.telemetry_context
+                .inject_context(&mut TraceHeaderInjector(req.headers_mut()));


The comment // TODO: let's decide at some point if the tracing headers // should be part of the fingerprint or not. indicates a pending decision that could impact the correctness or efficiency of deduplication. Tracing headers are part of the request context and could influence how a subgraph processes a request, even if the body is identical. If tracing headers are not part of the fingerprint, two logically distinct requests (from a tracing perspective) might be deduplicated, leading to incorrect trace correlation. This should be addressed to ensure correctness and maintainability.

gemini-code-assist · 2026-01-22T16:50:29Z

lib/internal/src/telemetry/traces/hive_console_exporter.rs

+                let trace_id = batch[graphql_idx].span_context.trace_id();
+                tracing::error!(
+                    { component = "hive_console_exporter", trace_id = trace_id.to_string() },
+                    "No matching 'http.client' or 'http.inflight' span found for 'graphql.subgraph.operation' span. Trace ignored.",
+                );
+                ignored_trace_ids.insert(trace_id);
+                continue;


The tracing::error! macro is used here to log when no matching http.client or http.inflight span is found for a graphql.subgraph.operation span, leading to the trace being ignored. This is a critical error in span correlation. Ensure that this logging itself does not become a performance bottleneck if this condition occurs frequently on a hot path. Consider if the error message can be made more concise to minimize string allocations, as per the repository style guide (line 20).

References

Hot-path budget ≈ near-zero: Guard everything non-trivial, minimize allocs, avoid dynamic dispatch and stringification in loops, keep data on the stack/arenas/Bytes where possible. ^(link)

lib/router-config/src/telemetry/hive.rs

lib/executor/src/executors/map.rs

gemini-code-assist · 2026-01-22T16:50:30Z

lib/internal/src/telemetry/traces/spans/http_request.rs

+    pub fn new(
+        method: &Method,
+        url: &Uri,
+        headers: &HeaderMap,
+        body_bytes: &[u8],
+        fingerprint: u64,
+    ) -> Self {
+        if !is_level_enabled(Level::INFO) {
+            return Self {
+                span: disabled_span(),
+            };
+        }
+
+        let (server_address, server_port) = match headers.get(HOST).and_then(|h| h.to_str().ok()) {
+            Some(host) => {
+                if let Some((host, port_str)) = host.rsplit_once(':') {
+                    (Some(host), port_str.parse::<u16>().ok())
+                } else {
+                    (Some(host), None)
+                }
+            }
+            None => (None, None),
+        };
+
+        let request_body_size = Some(body_bytes.len());
+        let request_method = Cow::Borrowed(method);
+        let header_user_agent = headers.get(USER_AGENT).map(Cow::Borrowed);
+        let url = Cow::Borrowed(url);
+        let protocol_version = version_to_protocol_version_attr(http::Version::HTTP_11);
+        let url_full = url.to_string();
+
+        // We follow the HTTP client span conventions:
+        // https://opentelemetry.io/docs/specs/semconv/http/http-spans/#http-client
+        let kind: &'static str = HiveSpanKind::HttpInflightRequest.into();
+        let span = info_span!(
+            target: TARGET_NAME,
+            "http.inflight",
+            "hive.kind" = kind,
+            "otel.status_code" = Empty,
+            "otel.kind" = "Internal",
+            "error.type" = Empty,
+
+            // Inflight Attributes
+            "hive.inflight.role" = Empty,
+            "hive.inflight.key" = fingerprint,
+
+            // Stable Attributes
+            "server.address" = server_address,
+            "server.port" = server_port,
+            "url.full" = url_full.as_str(),
+            "url.path" = url.path(),
+            "url.scheme" = url.scheme_str(),
+            "http.request.body.size" = request_body_size,
+            "http.request.method" = request_method.as_str(),
+            "network.protocol.version" = protocol_version,
+            "user_agent.original" = header_user_agent.as_ref().and_then(|v| v.to_str().ok()),
+            "http.response.status_code" = Empty,
+            "http.response.body.size" = Empty,
+        );


The HttpInflightRequestSpan::new function constructs a span with several attributes. Similar to the other HTTP span constructors, ensure that the extraction and formatting of these attributes (e.g., server_address, server_port, request_body_size, request_method, url_full, protocol_version) are as efficient as possible. Repeated string operations or complex parsing on a hot path can introduce overhead, as per the repository style guide (line 20).

References

Hot-path budget ≈ near-zero: Guard everything non-trivial, minimize allocs, avoid dynamic dispatch and stringification in loops, keep data on the stack/arenas/Bytes where possible. ^(link)

gemini-code-assist · 2026-01-22T16:50:30Z

lib/router-config/src/env_overrides.rs

+        if self.hive_access_token.is_some() && self.hive_target.is_some() {
+            config = config.set_override("telemetry.hive.usage.enabled", true)?;
+            config = config.set_override("telemetry.hive.tracing.enabled", true)?;
+        }


The logic here (if self.hive_access_token.is_some() && self.hive_target.is_some()) implies that both hive_access_token and hive_target must be present to enable both usage reporting and tracing. However, usage reporting and tracing might have different requirements or might be enabled independently. Consider if it's more flexible to enable them based on their individual presence rather than a combined check, or if this is an intentional design choice for simplicity.

intentional

gemini-code-assist · 2026-01-22T16:50:30Z

lib/internal/src/telemetry/traces/hive_console_exporter.rs

+                    tracing::debug!(
+                        { component = "hive_console_exporter" },
+                        "Attribute '{}' not found in 'http.server' span. Ignoring.",
+                        http_attr_key
+                    );


The tracing::debug! macro is used here to log when an attribute is not found in an http.server span. While this is a debug-level log, ensure that the string formatting and cloning (http_attr_key) are efficient. Repeated string operations in a loop, even at debug level, can add up if the loop runs frequently, as per the repository style guide (line 20).

References

Hot-path budget ≈ near-zero: Guard everything non-trivial, minimize allocs, avoid dynamic dispatch and stringification in loops, keep data on the stack/arenas/Bytes where possible. ^(link)

lib/router-config/src/telemetry/hive.rs

lib/router-config/src/telemetry/tracing.rs

Introduce an LRU cache to store trace IDs marked for dropping. This prevents re-processing the same trace IDs across multiple export batches, improving efficiency.

kamilkisiela force-pushed the kamil-tracing branch 2 times, most recently from d35c609 to 761d5d5 Compare December 11, 2025 11:36

kamilkisiela force-pushed the kamil-tracing branch from 761d5d5 to d470f76 Compare December 11, 2025 17:27

kamilkisiela force-pushed the kamil-tracing branch 2 times, most recently from 08b5890 to 5374314 Compare January 8, 2026 12:12

kamilkisiela marked this pull request as ready for review January 8, 2026 15:31

kamilkisiela changed the title ~~Telemetry~~ Tracing with OpenTelemetry Jan 8, 2026

gemini-code-assist bot reviewed Jan 8, 2026

View reviewed changes

lib/internal/src/telemetry/traces/spans/graphql.rs Show resolved Hide resolved

kamilkisiela force-pushed the kamil-tracing branch 5 times, most recently from 16534a4 to 6ab63db Compare January 16, 2026 12:23

kamilkisiela force-pushed the kamil-tracing branch from 9aec7ec to 9e78b9f Compare January 22, 2026 13:14

Tracing with OpenTelemetry

5a7a162

kamilkisiela force-pushed the kamil-tracing branch from 9e78b9f to 5a7a162 Compare January 22, 2026 16:33

test cache.hit attributes

2fb3028

gemini-code-assist bot reviewed Jan 22, 2026

View reviewed changes

kamilkisiela added 5 commits January 22, 2026 17:51

Switch tonic to tls-aws-lc and remove native roots

811de91

Prefer rustls for opentelemetry

5de5db4

usage -> usage_reporting

5cfd361

match capacity

a5a3010

Add moka cache for trace ID filtering

b9bd56e

Introduce an LRU cache to store trace IDs marked for dropping. This prevents re-processing the same trace IDs across multiple export batches, improving efficiency.

Tracing with OpenTelemetry #598

Are you sure you want to change the base?

Tracing with OpenTelemetry #598

Uh oh!

Conversation

kamilkisiela commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code changes

Performance costs?

How spans are created and used?

What exporters are supported?

Semantic conventions for HTTP

How deduplicated requests are traced?

Changes in Hive SDK

Uh oh!

github-actions bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ k6-benchmark results

Uh oh!

github-actions bot commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kamilkisiela commented Jan 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

kamilkisiela commented Jan 22, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

kamilkisiela Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kamilkisiela commented Dec 2, 2025 •

edited

Loading

github-actions bot commented Dec 11, 2025 •

edited

Loading

✅ `k6-benchmark` results

github-actions bot commented Dec 11, 2025 •

edited

Loading